Understanding Inefficiencies in Data-Intensive Computing

نویسندگان

Elie Krevat

Tomer Shiran

Eric Anderson

Joseph Tucek

Jay J. Wylie

Gregory R. Ganger

چکیده

New programming frameworks for scale-out parallel analysis, such as MapReduce and Hadoop, have become a cornerstone for exploiting large datasets. However, there has been little analysis of how such systems perform relative to the capabilities of the hardware on which they run. This paper describes a simple model of I/O resource consumption that predicts the ideal lowerbound runtime of a parallel dataflow on a particular set of hardware. Comparing actual system performance to the model’s ideal prediction exposes the inefficiency of a scale-out system. Using a simplified dataflow processing tool called Parallel DataSeries, we show that the model’s ideal can be approached (i.e., that it is not wildly optimistic), but that a gap of up to 20% remains for workloads using up to 45 nodes. Guided by the model, we analyze inefficiencies exposed in both the disk and networking subsystems—issues that will be faced by any DISC system built atop popular commodity hardware and OSs. Acknowledgements: We thank the members and companies of the PDL Consortium (including APC, EMC, Facebook, Google, HewlettPackard Labs, Hitachi, IBM, Intel, LSI, Microsoft Research, NEC Laboratories, NetApp, Oracle, Riverbed, Samsung, Seagate, STEC, Symantec, VMWare, and Yahoo! Labs) for their interest, insights, feedback, and support. This research was sponsored in part by an HP Innovation Research Award and by CyLab at Carnegie Mellon University under grant DAAD19–02–1–0389 from the Army Research Office.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Replication-Based Scheduling in Cloud Computing Environment

Abstract— High-performance computing and vast storage are two key factors required for executing data-intensive applications. In comparison with traditional distributed systems like data grid, cloud computing provides these factors in a more affordable, scalable and elastic platform. Furthermore, accessing data files is critical for performing such applications. Sometimes accessing data becomes...

متن کامل

Applying Simple Performance Models to Understand Inefficiencies in Data-Intensive Computing

متن کامل

Supporting Large Scale Data-Intensive Computing with the FusionFS Distributed File System

State-of-the-art yet decades-old architecture of HPC storage systems has segregated compute and storage resources, bringing unprecedented inefficiencies and bottlenecks at petascale levels and beyond. This paper presents FusionFS, a new distributed file system designed from the ground up for high scalability (16K nodes) while achieving significantly higher I/O performance (2.5TB/sec). FusionFS ...

متن کامل

Compiler and Runtime Supports for High-Performance, Scalable Big Data Systems

Big Data analytics applications such as social network analysis and web analysis have revolutionized modern computing. The processing demand posed by an unprecedented amount of data challenges both industrial practitioners and academia researchers to design and implement highly efficient and scalable system infrastructures. Unfortunately, Big Data processing is fundamentally limited by memory i...

متن کامل

A survey on impact of cloud computing security challenges on NFV infrastructure and risks mitigation solutions

Increased broadband data rate for end users and the cost of resource provisioning to an agreed SLA in telecom service providers, are forcing operators in order to adhere to employment Virtual Network Functions (VNF) in an NFV solution. The newly 5G mobile telecom technology is also based on NFV and Software Define Network (SDN) which inherit opportunities and threats of such constructs. Thus a ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

Understanding Inefficiencies in Data-Intensive Computing

نویسندگان

چکیده

منابع مشابه

Data Replication-Based Scheduling in Cloud Computing Environment

Applying Simple Performance Models to Understand Inefficiencies in Data-Intensive Computing

Supporting Large Scale Data-Intensive Computing with the FusionFS Distributed File System

Compiler and Runtime Supports for High-Performance, Scalable Big Data Systems

A survey on impact of cloud computing security challenges on NFV infrastructure and risks mitigation solutions

عنوان ژورنال:

اشتراک گذاری